support GLM-4.5V vision model #16600

ddh0 · 2025-10-15T19:09:23Z

Add support for zai-org/GLM-4.5V vision model to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe"). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM (text model `glm4v_moe_text`)

Based on GLM-4.5-Air
Tensor names start with model.language_model.
Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT (vision adapter `glm4v_moe`)

Adapted from apple/aimv2-huge-patch14-336:
- Architecture Aimv2VisionModel
- ~681M params
- 24 layers
- hidden_size (n_embd): 1536
- intermediate_size (n_ff): 4096
- image_size: 336
- patch_size: 14
- num_channels: 3
- depth: 24
Tensor names start with model.visual.
Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

Other notes:

Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
The model supports video input, but I currently do not plan to support video input in this PR (images only)
Tokenizer has video-related special tokens - need to handle these during conversion

References:

The 🤗 reference implementations:
- modeling_glm4v_moe.py
- modular_glm4v_moe.py
The 🤗 model card
The 🤗 config.json

So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with mtmd, so I may not be able to get this PR done on my own. I will keep trying to hack at it when I have time, and I would appreciate any help I could get. :)

Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud.

rujialiu · 2025-10-18T02:16:21Z

Thanks for your work! @ddh0
Based on the commit history, the imports of qwen3vl is the result of a "use qwen data class to avoid repeat again" refactor, so probably it's not quite "based on Qwen3-VL". But anyway, I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

e1732a364fed · 2025-10-18T02:26:06Z

Here's my silly implementation for the unfinished mmproj part (class GLM4VMoEVisionModel(MmprojModel):):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        assert self.has_vision_encoder
        assert self.hparams_vision is not None
  
        self.hparams_vision["num_attention_heads"] = self.hparams_vision.get("num_heads")
        self.hparams_vision["num_hidden_layers"] = self.hparams_vision.get("depth")
    def set_gguf_parameters(self):
        # remain code from ddh0 as is
    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
        del bid # unused
        if name.startswith("model.visual."):
            name = name.replace("model.visual.", "visual.", 1)

            if ".qkv." in name:
                if data_torch.ndim == 2: # weight
                    c3, _ = data_torch.shape
                else: # bias
                    c3 = data_torch.shape[0]
                assert c3 % 3 == 0
                c = c3 // 3
                wq = data_torch[:c]
                wk = data_torch[c: c * 2]
                wv = data_torch[c * 2:]
                return [
                    (self.map_tensor_name(name.replace("qkv", "q")), wq),
                    (self.map_tensor_name(name.replace("qkv", "k")), wk),
                    (self.map_tensor_name(name.replace("qkv", "v")), wv),
                ]

            if name.startswith("visual.downsample."):
                suffix = name.split(".", 2)[2]
                new_name = self.format_tensor_name(gguf.MODEL_TENSOR.V_POST_NORM, suffix=suffix[1])
                return [(new_name, data_torch)]

            yield self.map_tensor_name(name), data_torch
        else:
            return

Then edit gguf-py/gguf/tensor_mapping.py for TensorNameMap to add the following names:

visual.embeddings.position_embedding
visual.merger.proj.weight
visual.merger.up_proj.weight
visual.merger.gate_proj.weight
visual.merger.down_proj.weight
visual.merger.post_projection_norm
visual.post_conv_layernorm.weight
visual.post_layernorm.weight
visual.merger

If you put these in the right place, running

python convert_hf_to_gguf.py /path/to/GLM-4.5V --outfile /path/to/GLM-4.5V-mmproj.gguf --mmproj

will succeed.

But I can only help here. Lacking of knowledge of the model prevents me from digging further. And the converted mmproj won't work until we do it right.

And I believe my treatment for visual.downsample is completely wrong... It's just a copy of other codes... I don't know what I'm doing...

Maybe refer to
https://huggingface.co/yairpatch/Qwen3-VL-30B-A3B-Thinking-GGUF/blob/main/qwen3vl-implementation.patch

ddh0 added 5 commits October 15, 2025 14:02

initial commit for branch glm45v

8448b23

use F32 accumulators for GLM4V_MOE

70c8686

add arch

631d4fa

llama-model : add placeholders

2aa6985

fix arch name for tensor names

d0e9dce

github-actions bot added the python python script changes label Oct 15, 2025

ddh0 mentioned this pull request Oct 15, 2025

support GLM-4.5V (108B VLM) ddh0/llama.cpp#3

Closed

ddh0 added 8 commits October 15, 2025 14:33

Merge branch 'ggml-org:master' into glm45v-2

0a72591

WIP conversion logic

01d085d

Merge branch 'ggml-org:master' into glm45v-2

5633947

better class names

14cee9c

add clip.vision.rope.* to GGUF constants

e0b6064

need `clip.vision.rope.freq_base` for GLM-4.5V

add add_vision_rope_freq_base for GGUF metadata

7bdc330

set clip.vision.rope.freq_base during conversion

ed7c271

Merge branch 'ggml-org:master' into glm45v-2

9e9a4a8

Merge branch 'ggml-org:master' into glm45v-2

a41109d

rujialiu mentioned this pull request Oct 18, 2025

Fix incorrect causual attention mask caused by M-Rope #15474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support GLM-4.5V vision model #16600

support GLM-4.5V vision model #16600

ddh0 commented Oct 15, 2025

Uh oh!

ddh0 commented Oct 17, 2025 •

edited

Loading

Uh oh!

rujialiu commented Oct 18, 2025

Uh oh!

e1732a364fed commented Oct 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

support GLM-4.5V vision model #16600

Are you sure you want to change the base?

support GLM-4.5V vision model #16600

Conversation

ddh0 commented Oct 15, 2025

LLM (text model glm4v_moe_text)

ViT (vision adapter glm4v_moe)

Other notes:

References:

See also:

Uh oh!

ddh0 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rujialiu commented Oct 18, 2025

Uh oh!

e1732a364fed commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LLM (text model `glm4v_moe_text`)

ViT (vision adapter `glm4v_moe`)

ddh0 commented Oct 17, 2025 •

edited

Loading

e1732a364fed commented Oct 18, 2025 •

edited

Loading